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Abstract 

The coding and noncoding length sequences constructed from a complete genome 
are characterised by multifractal analysis. The dimension spectrum D q and its deriva- 
tive, the 'analogous' specific heat C q , are calculated for the coding and noncoding 
length sequences of bacteria, where q is the moment order of the partition sum of the 
sequences. From the shape of the D q and C q curves, it is seen that there exists a clear 
difference between the coding/noncoding length sequences of all organisms considered 
and a completely random sequence. The complexity of noncoding length sequences 
is higher than that of coding length sequences for bacteria. Almost all D q curves for 



* Corresponding author, corresponding address: School of Mathematical Science, Queensland University 
of Technology, Garden Point Campus, GPO Box 2434, Brisbane, Q 4001, Australia. Tel.: +61 7 38645194, 
Fax: +61 7 38642310, e-mail: yuzg@hotmail.com or z.yu@qut.edu.au 

TThis is the permanent corresponding address of Zu-Guo Yu. 



1 



coding length sequences are flat, so their multifractality is small whereas almost all 
D q curves for noncoding length sequences are multifractal-like. It is seen that the 
'analogous' specific heats of noncoding length sequences of bacteria have a rich variety 
of behaviour which is much more complex than that of coding length sequences. We 
propose to characterise the bacteria according to the types of the C q curves of their 
noncoding length sequences. This new type of classification allows a better understand- 
ing of the relationship among bacteria at the global gene level instead of nucleotide 
sequence level. 

PACS numbers: 87.10+e, 47.53+n 

Key words: Coding/noncoding segments, length sequence, complete genome, multifrac- 
tal analysis, 'analogous' specific heat. 



1 Introduction 

The rapidly accumulating complete genome sequences of bacteria and archaea provide a 
new type of information resource for understanding gene functions and evolutional. 

One can study the DNA sequences with details by considering the order of four kinds 
of nucleotides which DNA is assembled, namely adenine (a), cytosine (c), guanine (g), and 
thymine (i). 

There has been considerable interest in the finding of long-range correlation (LRC) in 
DNA sequences at this level. Li et a/@ found that the spectral density of a DNA sequence 
containing mostly introns shows 1/f 13 behaviour, which indicates the presence of LRC. The 
correlation properties of coding and noncoding DNA sequences were also studied by Peng 
et aE in their fractal landscape or DNA walk model. The DNA walk denned in Ref. || 
is that the walker steps "up" if a pyrimidine (c or t) occurs at position i along the DNA 
chain, while the walker steps "down" if a purine (a or g) occurs at position i. Peng et a/0 
discovered that there exists LRC in noncoding DNA sequences while the coding sequences 
correspond to a regular random walk. By doing a more detailed analysis, Chatzidimitriou- 
Dreismann and LarhammarB concluded that both coding and noncoding sequences exhibit 
LRC. A subsequent work by Prabhu and Claverie0 also substantially corroborates these 
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results. If one considers more details by distinguishing c from t in pyrimidine, and a from g 
in purine (such as two or three-dimensional DNA walk modelS and maps given in Ref. 0), 
then the presence of base correlation has been found even in coding sequences. In view of 
the controversy about the presence of correlation in all DNA or only in noncoding DNA, 
Buldyrev et showed the LRC appears mainly in noncoding DNA using all the DNA 
sequences available. Alternatively, VossB, based on equal-symbol correlation, showed a 
power law behaviour for the sequences studied regardless of the percent of intron contents. 
Investigations based on different models seem to suggest different results, as they all look 
into only a certain aspect of the entire DNA sequenced'. 

The avoided and under-represented strings in some bacterial complete genomes have been 
discussedO' !§L A time series model of CDS in complete genome has been proposed^ 



VieiraE performed a low-frequency analysis of the complete DNA of 13 microbial genomes 
and found that their fractal behaviour does not always prevail through the entire chain and 
their autocorrelation functions have a rich variety of behaviours including the presence of 
anti-persistence. 

For the importance of the numbers, sizes and ordering of genes along the chromosome, 



one can refer to Part 5 of LewinGJj. Here one may ignore the composition of the four 
kinds of bases in coding and noncoding segments and only considers the rough structure 
of the complete genome or long DNA sequences. Provata and Almirantis proposed a 
fractal Cantor pattern of DNA. They map coding segments to filled regions and noncoding 
segments to empty regions of random Cantor set and then calculate the fractal dimension of 
the random fractal set. They found that the coding/noncoding partition in DNA sequences 
of lower organisms is homogeneous-like, while in the higher eucariotes the partition is fractal. 
This result seems too rough to distinguish bacteria because the fractal dimensions of bacteria 
they gave out are all the same. 

Viewing from the level of structure, the complete genome of an organism is made up of 
coding and noncoding segments. Here the length of a coding/noncoding segment means the 
number of its bases. Based on the lengths of coding/noncoding segments in the complete 
genome, one can get two kinds of integer sequences by the following ways: 

i) Order all lengths of coding segments according to the order of coding segments in the 
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complete genome. This integer sequence is named coding length sequence. 

ii) Order all lengths of noncoding segments according to the order of noncoding segments 
in the complete genome. This integer sequence is named noncoding length sequence. 

Yu and AnhEB proposed a time series model for the length sequences of DNA. After 
calculating the correlation dimensions and Hurst exponents, it was found that one can get 



more information from this model than that of fractal Cantor pattern . The quantification 
of these correlations could give an insight into the role of the ordering of genes on the 
chromosome. Through detrended fluctuation analysis (DFA)GH and spectral analysis, the 
LRC was found in these length sequences H3. 

The correlation dimension and Hurst exponent are parameters of global analysis. Global 
calculations neglect the fact that length sequences from a complete genome are highly inho- 
mogeneous. Thus multifractal analysis is a useful way to characterise the spatial inhomo- 
geneity of both theoretical and experimental fractal patterns^. It was initially proposed to 
treat turbulence data. In recent years it has been applied successfully in many different fields 
including time series analysis© and financial modellinglHi HU. For DNA sequences, appli- 
cation of the multifractal technique seems rare (we have found only Berthelsen et a/jEH). 
Recently, Yu et considered the multifractal property of the measure representation of a 
complete genome. In this paper, we pay more attention to the multifractal characterisation 
of the coding and noncoding length sequences. 

Some sets of physical interest have a nonanalytic dependence of dimension spectrum 
D q on the g-moments of the partition sum of the sequences. Moreover, multifractality has 
a direct analogy to the phenomenon of phase transition in condensed- matter physics ^3. 
The existence and type of phase transitions might turn out to be a worthwhile charac- 
terisation of universality classes for the structures^. The concept of phase transition 
in multifractal spectra was introduced in the study of logistic maps, Julia sets and other 
simple systems. Evidence of phase transition was found in the multifractal spectrum of 
diffusion-limited aggregationUll . By following the thermodynamic formulation of multifrac- 
tal measures, where q represents an analogous temperature, Canessa@ applied a standard 
expression for the 'analogous' specific heat and showed that its form resembles a classical 
phase transition at a critical point for financial time series. 
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In this paper we calculate the 'analogous' specific heat of coding and noncoding length 
sequences. Our motivation to apply Canessa's framework to characterise stochastic sequences 
is to see whether there is a similar type of phase transition in the coding and noncoding length 
sequences as in other time series. We show that based on the shape of the C q curves and 
associated type of phase transitions, one can discuss the classification of bacteria. This new 
type of classification allows to better understand the relationship among bacteria at the 
global gene level instead of nucleotide sequence level. 



2 Multifractal analysis 



Let Tt, t = 1, 2, • • • , N, be the length sequence of coding or noncoding segments in the 
complete genome of an organism. First we define 

N 

to be the frequency of T t . It follows that J2t Ft = 1- Now we can define a measure /i on [0, 1[ 
by d[i(x) = Y(x)dx, where 

Y(x) = N X F t , when x G [^-, ^[. (2) 

It is easy to see that Jq d/j,(x) = 1 and fi([(t — 1)/N,t/N[) = F t . 

The most common numerical implementations of multifractal analysis are the so-called 
fixed-size box-counting algorithms 0. In the one-dimensional case, for a given measure /i 
with support E C R, we consider the partition sum 

zM)= E H B )\ q , (3) 

M (B)^0 

q G R, where the sum runs over all different nonempty boxes B of a given side e in a grid 
covering of the support E, that is, 

B = [ke, (k + l)e[. (4) 
The scaling exponent r{q) is defined by 

T(9) = lim !2i^M (5) 



and the generalized fractal dimensions of the measure are defined as 



D q = r(q)/(q-l), for q ^ 1, (6) 

and 

D q = lim r ^ L , for g = l, (7) 
e^o log e 

where Z^ e = ][] M( - B ^ log/z(£>). The generalized fractal dimensions are numerically 
estimated through a linear regression of 

against loge for q ^ 1, and similarly through a linear regression of Zi e against loge for 
9 = 1. -Di is called the information dimension and D 2 the correlation dimension. The D g of 
the positive values of g give relevance to the regions where the measure is large, i.e., to the 
coding or noncoding segments which are relatively long. The D q of the negative values of q 
deal with the structure and the properties of the most rarefied regions of the measure, i.e. 
to the segments which are relatively short. 

By following the thermodynamic formulation of multifractal measures, Canessa^ de- 
rived an expression for the 'analogous' specific heat as 



C q = -^0- « 2r(q) - r(q + 1) - r(q - 1). (8) 



He showed that the form of C q resembles a classical phase transition at a critical point for 
financial time series. In the following we calculate the 'analogous' specific heat of coding and 
noncoding length sequences for the first time. The types of phase transitions are helpful to 
discuss the classification of bacteria. 



3 Data and results 

More than 31 bacterial complete genomes are now available in public databases. There are 
five Archaebacteria: Archaeoglobus fulgidus (aful), Pyrococcus abyssi (pabyssi), Methanococ- 
cus jannaschii (mjan), Aeropyrum pernix (aero) and Methanobacterium thermoautotroph- 
icum (mthe); five Gram-positive Eubacteria: Mycobacterium tuberculosis (mtub), Mycoplasma 
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pneumoniae (mpneu), Mycoplasma genitalium (mgen), Ureaplasma urealyticum (uure), and 
Bacillus subtilis (bsub). The others are Gram-negative Eubacteria, which consist of two 
Hyperthermophilic bacteria: Aquifex aeolicus (aquae) and Thermotoga maritima (tmar); 
three Chlamydia: Chlamydia trachomatisserovar (ctra), Chlamydia muridarum (ctraM), 
and Chlamydia pneumoniae (cpneu); two Spirochaete: Borrelia burgdorferi (bbur) and Tre- 
ponema pallidum (tpal); one Cyanobacterium: Synechocystis sp. PCC6803 (synecho); and 
thirteen Proteobacteria. The thirteen Proteobacteria are divided into four subdivisions, 
which are alpha subdivision: Rhizobium sp. NGR234 (pNGR234) and Rickettsia prowazekii 
(rpxx); gamma subdivision: Escherichia coli (ecoli), Haemophilus influenzae (hinf), Xylella 
fastidiosa (xfas), Vibrio cholerae (vchol), Pseudomonas aeruginosa (paer) and Buchnera sp. 
APS (buch); beta subdivision: Neisseria meningitidis MC58 (nmen) and Neisseria menin- 
gitidis Z2491 (nmenA); epsilon subdivision: Helicobacter pylori J99 (hpyl99), Helicobacter 
pylori 26695 (hpyl) and Campylobacter jejuni (cjej). 

First we counted out the length of coding and noncoding segments in the complete 
genomes of the above bacteria and obtained the coding and noncoding length sequences 
of these organisms. For example, we give the coding and noncoding length sequences of 
Pseudomonas aeruginosa (paer) in Figure [TJ. 

Then we calculated the dimension spectra D q and 'analogous' specific heat C q of the 
coding and noncoding length sequences of all the above bacteria according to the methods 
given in section 2. In order to show the difference between coding and noncoding length 
sequences, we give the C q curves of length sequences of all the above bacteria as Figure || 
(for 19 bacteria) and Figure |3| (for another 12 bacteria). 

The hill behaviour of the dimension spectrum D q for q < is a well known fact when 
using the box-counting method£3 In Figures ^ and |5], we present D q of the coding or 
noncoding length sequences of all bacteria selected within the range q > 0. 

4 Discussion and conclusions 

If a length sequence is completely random, then our measure definition yields a uniform 
measure (D q = 1, C q = 0). 
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From the curves of D q and C q , it is seen that there exists a clear difference between 
the coding/noncoding length sequences of all organisms considered here and the completely 
random sequence. Hence we can conclude that complete genomes are not random sequences. 
But the D q values of coding length sequences are closer to 1 than that of noncoding length 
sequences. In other words, noncoding length sequences are further away from a complete 
random sequence than coding length sequences. The property of the length sequences is 
same as that of the DNA sequences^'. 

We also found that for each bacterium selected, the D q values for q > of a noncoding 
length sequence are smaller than those of a coding length sequence, but for q < 0, the 
situation is reversed. It is well known that the dimension is a measure for complexity. Here 
the complexity of noncoding length sequences is higher than that of coding length sequences 
for bacteria. 

From Figures [| and |5], almost all D q curves for coding length sequences are flat, so their 
multifractality is not pronounced. On the other hand, almost all D q curves for noncoding 
length sequences are multifractal-like. 

In our previous paper^S, we counted out all substrings with fixed length appearing 
in the complete genome and gave a measure representation of the complete genome. We 
found that the shape of the C q curves of all bacteria we selected are single-peaked. Hence 
this type of phase transition of the measure representation is not useful for classification of 
bacteria. On the other hand, from Figures H and ^ one can see that the 'analogous' specific 
heats of noncoding length sequences of bacteria have a rich variety of behaviours which is 
much more complex than that of coding length sequences. Some have only one main single 
peak. In this class, some C q curves display a shoulder to the right of the main peak, some 
display a shoulder to the left of the main peak, and some have no shoulder, which resembles 
a classical (first-order)phase transition at a critical point. In another class, the C q curves 
display a balance double-peak. So this provides a useful tool for classification of bacteria 
according to the types of 'analogous' specific heats of the noncoding length sequences. The 
relevant finding here is that noncoding length sequences display higher C q peak heights and 
clear double peaked structures than coding length sequences. This reveals different types of 
long-range correlations between the two classes of sequences. This new type of classification 
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allows a better understanding of the relationship among bacteria at the global gene level 
instead of nucleotide sequence level. It can be useful to distinguish between sequence curves 
as given in the example of Figure [I]. 

To conclude, multifractal analysis provides a simple yet powerful method to amplify 
the difference between a DNA length sequence and a random sequence. In particular, the 
multifractal characterisation given by the 'analogous' specific heat allows to distinguish DNA 
length sequences in more details. 
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Figure 1: The coding and noncoding length sequences of Pseudomonas aeruginosa 
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Figure 2: C q curves of coding and noncoding length sequences of 19 Bacteria. 
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gure 3: C q curves of coding and noncoding length sequences of another 12 Bacteria. 
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Figure 4: D q curves of coding and noncoding length sequences of 19 Bacteria. 
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